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Abstract 

1 — | High-dimensional tensors or multi-way data are becoming prevalent in areas such 

as biomedical imaging, chemometrics, networking and bibliometrics. Traditional ap- 

l—l 

proaches to finding lower dimensional representations of tensor data include flattening 
the data and applying matrix factorizations such as principal components analysis 

^— > 

(PCA) or employing tensor decompositions such as the CANDECOMP / PARAFAC 
(CP) and Tucker decompositions. The former can lose important structure in the 
^5 data, while the latter Higher-Order PCA (HOPCA) methods can be problematic in 

high-dimensions with many irrelevant features. We introduce frameworks for sparse 
tensor factorizations or Sparse HOPCA based on heuristic algorithmic approaches and 
i— H by solving penalized optimization problems related to the CP decomposition. Exten- 

. !_h sions of these approaches lead to methods for general regularized tensor factorizations, 

><! 

multi-way Functional HOPCA and generalizations of HOPCA for structured data. We 
illustrate the utility of our methods for dimension reduction, feature selection, and 
signal recovery on simulated data and multi-dimensional microarrays and functional 
MRIs. 
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1 Introduction 



Recently high-dimensional tensor data has become prevalent in areas of biomedical imaging, 
remote sensing, biblio metrics, chemometrics and the Internet. We define high-dimensional 
tensor data to be multi-dimensional data, with three or more dimensions or tensor modes, 
in which the dimension of one or more modes is large compared to the remainder. Examples 
of this include functional magnetic resonance imaging (fMRI) in which three-dimensional 
images of the brain measured in voxels (when vectorized, these form mode 1) are taken over 
time (mode 2) for several experimental conditions (mode 3) and for a set of independent 
subjects (mode 4). Here, the "samples" in this data set are the independent subjects (~ 
30) or experimental tasks (~ 20), the dimension of which are small compared to the number 
of voxels (~ 60,000) and time points (~ 2,000). Exploratory analysis, including dimension 
reduction, pattern recognition, visualization, and feature selection, for this massive tensor 
data is a major mathematical and computational challenge. In this paper, we propose 
flexible unsupervised multivariate methods, specifically regularized Higher-Order Principal 
Components Analysis (HOPCA), to understand and explore high-dimensional tensor data. 

Classically, multivariate methods such as principal components analysis (PCA) have been 
applied to matrix data in which the number of independent observations n is larger than the 
number of variables p. Exploring tensor data using these methods would require flattening 
the tensor into a matrix with one dimension enormous compared to the other. This approach 
is not ideal as to begin with, the important mult i- dimensional structure of the tensor is lost. 
To solve this, many use traditional tensor decompositions such as the CANDECOMP / 



PARAFAC (CP) (Harshman 1970; Carroll and Chang 1970) and Tucker decompositions 



(Tucker 1966) to compute HOPCA (Kolda and Bader 2009). These factorizations seek to 
approximate the tensor by a low-rank set of factors along each of the tensor modes. Unlike 
the singular value decomposition (SVD) for matrix data, these tensor decompositions do 



not uniquely decompose the data (Kolda and Bader, 2009) further complicating the analysis 
of tensor data. Especially in the statistics community, there has been relatively little work 
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done on methodology for high-dimensional tensor data (McCullagh, 1987 Kroonenberg 



2008 Hoff 2011) 



Many, especially in applied mathematics, have studied tensor decompositions, but rela- 
tively few have advocated encouraging sparsity or other types of regularization in the tensor 
factors. The exception is the non-negative tensor factorization, which is a well developed 



area and has been used to cluster tensor data (Cichocki et al. , 2009). In the context of these 



non-negative factorizations, several have discussed sparsity (Hazan et al. , 2005 M0rup et al. 



2008 Lim and Comon, 2009; Cichocki et al., 2009 Liu et al., 2012; Chi and Kolda 2011), 



and a few have gone on to explicitly encourage sparsity in one of the tensor factors (Ruiters 



and Klein 


2009 


Pang et al. 


2011) 



flexible regularization in tensor factorizations and HOPCA is lacking. 

Sparsity in tensor decompositions and HOPCA is desirable for many reasons. First, 



tensor decompositions are often used to compress large multi-dimensional data sets (Kolda 



and Bader, 2009). Sparsity in the tensor factors compresses the data further and is attractive 
from a data storage perspective. Second, in high-dimensional settings, many features are 
often irrelevant. With fMRIs, for example, there are often hundreds of thousands of voxels 
in each image with many voxels inactive for much of the scanning session. Sparsity gives one 
an automatic tool for feature selection. Third, many have noted that PCA is asymptotically 



inconsistent in high-dimensional settings (Johnstone and Lu, 2009 Jung and Marron, 2009). 



As this is true for matrix data, it is not hard to surmise that asymptotic inconsistency of 
the factors holds for HOPCA as well. Sparsity in PCA, however, has been shown to yield 



consistent PC directions (Johnstone and Lu, 2009 Amini and Wainwright, 2009). Finally 



for high-dimensional tensors, visualizing and interpreting the HOPCs can be a challenge. 
Sparsity limits the number of features and hence simplifies visualization and interpretation 
of exploratory data analysis results. Similar arguments can be made for smoothing the tensor 
factors or encouraging other types of regularization. 

In this paper, we seek a mathematically sound and computationally attractive framework 
for regularizing one, two, or all of the tensor factors, or HOPCs. Our approach is based on 



optimization theory. That is, we seek to find a coherent optimization criterion for regular- 
ized HOPCA related to traditional tensor decomposition approaches and an algorithm that 
converges to a solution to this optimization problem. Recently, many have proposed similar 



optimization-based approaches to perform regularized PCA. Beginning with Jolliffe et al. 



(2003), several have proposed to find Sparse PCs by solving an optimization problem related 



the PCA with an additional fj-norm penalty (Zou et al. , 2006 Shen and Huang, 2008). 



Others have shown that Functional PCA can be achieved by solving a PCA optimization 



problem with a penalty to induce smoothness (Silverman, 1996 Huang et al., 2008). More 



recently, several have extended these optimization approaches to two-way penalized SVDs 



by placing penalties on both the row and column factors simultaneously (Witten et al. , 2009 



Huang et al. , 


2009 


Lee et al. 


2010 


Allen et al. 


2011) 



ized HOPCA most closely follows the work of Shen and Huang (2008) and later Witten et al. 



(2009) and Allen et al. (2011) who iteratively solve penalized regression problems converging 
to a rank-one solution with subsequent factors found via a greedy deflation method. 

To limit the scope of our exploration, we first develop a framework for Sparse HOPCA and 
later show that extensions of this framework yield methods for regularization with general 
penalties, Functional HOPCA, and Generalized HOPCA for structured tensor data. Addi- 
tionally, as the concept of regularization in HOPCA is relatively undeveloped, we employ 
a less common strategy by first introducing several flawed approaches to Sparse HOPCA. 
We do this for a number of reasons, namely to illuminate these problematic paths for fu- 
ture researchers, to give us a basis for comparison for our optimization-based methods, and 
to provide further background and justification for our approach. Our optimization-based 
approach directly solves a multi-way concave relaxation of a rank-one HOPCA method. 
Overall, this method has many mathematical and computational advantages including guar- 
anteed convergence to a local optimum. 

This paper is organized as follows. As the general statistical audience may be unfamiliar 
with tensors and tensor decompositions for HOPCA, we begin by introducing notation, 
reviewing these decompositions, and discussing these in the context of PCA in Section [2] 



We take a first step at considering how to introduce sparsity in this context in Section [3] by 
walking the reader through several novel (but, flawed!) algorithmic approaches to Sparse 
HOPCA. Then, in Section |4| we introduce our main novel method and algorithm for Sparse 
HOPCA, based on deflation and the CP factorization. In Section [5j we outline several novel 
extensions of this approach to allow flexible modeling with general penalties and functional 
or structured tensor data. Simulation studies are presented in Section [6] followed by two case 
studies to high-dimensional three-way microarray data and an fMRI study. We conclude 
with a discussion, Section [7| of the implications of our work as well as the many directions 
for future work on tensor data. 



2 Preliminaries: Tensor Decompositions 

In this section, we review the two major approaches to tensor decompositions and HOPCA, 
the CANDECOMP / PARAFAC (CP) and Tucker decompositions. We also introduce no- 
tation and discuss considerations for evaluating HOPCA such as the amount of variance 
explained. 

As notation with multi-way data can be cumbersome, we begin by reviewing the notation 



used throughout this paper, mostly adopted from Kolda and Bader (2009). Tensors will be 
denoted as X, matrices as X, vectors as x and scalars as x. As there are many types of 
multiplication with tensors, the outer product will be denoted by o, xoy = xy T . Specific 
dimensions of the tensor will be called modes and multiplication by a matrix or vector 
along a tensor mode will be denoted as Xi; here, the subscript refers to the mode being 
multiplied using regular matrix multiplication. For example, if X G 5R" xpXl1 and U G $t nxK , 
then X XiU G $l Kxpxq . Sometimes it is necessary to flatten the tensor into a matrix, or 
matricize the tensor. This is denoted as X(i) where the subscript indicates the mode along 
which the matricization has occurred. For example, if X G §l nx P x i y then Xm G $t. nxpq . 



The tensor Frobenius norm, refers to = \/^2i^2j^2k^ijk- Two typ 



)CS 



of matrix multiplication that will be important in this paper are the Kronecker product 
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and the perhaps less familiar Khatri-Rao product. The former is denoted by <8> and the 
latter by 0. For matrices U £ ^t nxK and V £ W x , the Khatri-Rao product is defined 
as UoV = [ui<g>Vi ... Ujf <8) Vj<-] . For notational simplicity, all results in this paper 
will be presented for the three-mode tensor. Our methods can all be trivially extended to 
multi-dimensional tensors with an arbitrary number of modes. 

As discussed in the introduction, there is no equivalent to the SVD for multi-way data 
with three or more dimensions. Thus, for tensor data, people commonly use one of two 
tensor decompositions that capture desirable aspects of the SVD. The CP decomposition 
seeks to model a tensor as a sum of rank one tensors, where rank one three-mode tensors 
are defined as an outer product of three vectors: X ps Y^=i dk u& o o w^, where £ !ft n , 



v& £ 9ft p , Wfc £ ffl and > (Harshman, 1970 Carroll and Chang, 1970). In this paper 



we will assume that the three vectors have norm one, i.e. u T u = 1, and define the factor 
matrices as U £ W lxK = [u% ... Uk] and so forth. Thus, similar to the SVD, the CP 
seeks to factorize the data into a sum of rank-one arrays, although unlike the SVD these 
rank-one arrays do not uniquely decompose the data and in general, are not orthonormal. 
The Tucker decomposition, sometimes referred to as the Higher-Order SVD (HOSVD) or 
HOPCA, seeks to model a three-mode tensor asAf^X>x 1 Ux 2 V x 3 W where the factors 
U £ K nxA \ V £ W xl<2 and W £ $t qxKs are orthonormal and T> £ ^ xK ^ xK ^ is the 



core tensor (Tucker, 1966; De Lathauwer et al. , 2000). Hence, the factors of the Tucker 



decomposition are orthonormal, similar to those of the SVD, but these orthonormal factors 
do not uniquely decompose the array into a sum of rank-one factors with a diagonal core. 

Both the CP and Tucker decompositions can be used for tensor data in a similar manner 
to PCA for matrix data. In other words, one can think of the tensor factors as major modes 
of variation or patterns in the data that can be used for exploratory analysis, visualization, 
and dimension reduction. A critical quantity in assessing the dimension reduction achieved 
for PCA is the amount of variance explained by each of the SVD factors. As existing tensor 
factorizations and the methods we will propose in this paper do not uniquely decompose the 
data array, we cannot simply sum the equivalent of the squared singular values to measure the 
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variance explained. Since the Tucker decomposition imposes orthonormality on the factors, 
the proportion of variance explained by the decomposition has a simple form: 



(De Lathauwer et al. 2000 Kolda and Bader 2009). This is not the case for the CP 
decomposition where many have erroneously referred the the variance explained by the k th 
component as d\/\ \ X \ \ F . Instead, as the factors of the CP decomposition may be correlated, 
we must project out these effects to determine the proportion of variance explained: 

Theorem 1. Define = Ufc(U^ U^) -1 where U k = [ui,...Ufc] and define Pjj^ 

and analogously. Then, the cumulative proportion of variance explained by the first k 

HOPCs (or regularized HOPCs) is given by \\X XiPf ) x 2 Pi y) x 3 pf° \\ 2 F / \\X\\ 2 F . 

As our regularized tensor decompositions presented in the next two sections will not im- 
pose orthogonality on the factors, this result is critical in evaluating the dimension reduction 
achieved by both the CP decomposition and our regularized HOPCA methods. 

3 Algorithmic Approaches to Sparse HOPCA 

We begin our exploration of methods for finding Sparse Higher-Order Principal Components 
by proposing three algorithmic approaches. These methods are direct extensions of popular 
algorithms for the CP and Tucker decompositions that one might first think of when seeking 
sparsity. We will show, however, that while these approaches are intuitively simple, they 
are problematic both mathematically and computationally as they do not solve a coherent 
mathematical objective. Despite this, these methods are important to present to both steer 
future investigators from problematic paths and provide background, rationale, and a basis 
for comparison to our subsequent optimization-based Sparse HOPCA methods. Here, we 
walk through each of these algorithmic approaches; outlines of the full algorithms are given 
in the Supplemental Materials. 

The Higher-Order SVD (HOSVD) and Higher-Order Orthogonal Iteration (HOOI), some- 
times referred to as the Tucker Alternating Least Squares (Tucker-ALS), are two popular 



algorithms for computing the Tucker decomposition (Tucker, 1966; De Lathauwer et al., 2000 



Kolda and Bader 2009 ). The former estimates each factor matrix by calculating the singular 



vectors of the tensor matricized along each mode (Tucker 1966 De Lathauwer et al. 2000). 
In other words for a three-mode tensor, the HOSVD can be found by performing PCA three 
times on data flattened along each of the three dimensions. The HOOI seeks to improve 
upon this decomposition by minimizing a Frobenius norm loss between X and the Tucker 
decomposition T> Xi\J T x 2 V T x 3 W T subject to orthogonality constraints on the factors. 



De Lathauwer et al. (2000) showed that this problem is solved by an iterative algorithm 



where each Tucker factor is estimated by computing the singular vector of the tensor pro- 
jected onto the other factors. In other words, the HOOI updates U by performing PCA on 
the matricized (X x 2 V x 3 W)(i). Thus, both the HOSVD and HOOI algorithms estimate 
the factors by performing PCA on a matricized tensor. This leads to a simple strategy for 
obtaining Sparse HOSVD and Sparse HOOI: Replace PCA with Sparse PCA in each algo- 
rithm to obtain sparse factors for each tensor mode. Many algorithms exist for performing 



Sparse PCA (Jolliffe et al. 2003 Zou et al, 2006 Shen and Huang, 2008 Johnstone and 



Lu, 2009 Journee et al. , 2010), any of which can be used to compute the Sparse HOSVD or 



Sparse HOOI. 

While these methods for Sparse HOPCA based on the Tucker decomposition are concep- 
tually simple, they are not ideal for several reasons. First, the methods are purely heuristic 
algorithmic approaches and do not solve any coherent mathematical optimization problem. 
For the Sparse HOOI method, this is particularly problematic as one cannot guarantee 
convergence of the algorithm without additional constraints. Secondly in high-dimensional 
settings, matricizing the tensor along each mode and performing Sparse PCA is computa- 
tionally intensive and requires large amounts of computer memory. Employing the Sparse 



PCA methods of Jolliffe et al. (2003); Zou et al. (2006) requires forming and computing the 
leading sparse eigenvalues of n x n, p x p, and q x q matrices, which in high-dimensional 



settings are typically much larger than the original data array. The methods of Shen and 



Huang (2008); Journee et al. (2010) require computing the sparse singular vectors of n x pq, 



p x nq, and q x np, which corresponds to calculating several unnecessary and extremely large 



singular vectors. Hence, even though the Sparse HOSVD and HOOI are attractive in their 
simplicity, mathematically and computationally they are less desirable. 

Next, we consider incorporating sparsity in the CP decomposition through the popular 
CP Alternating Least Squares (CP-ALS) Algorithm. This algorithm updates one factor at 



a time by minimizing a Frobenius norm loss with the other factors fixed (Harshman, 1970 



Carroll and Chang, 1970; Kolda and Bader, 2009). Consider, for example, solving for U with 



V and W fixed. Then, the loss function can be written as 1 1 X — ^2 k=1 dk u& o v k o w/s 



X(i) -U(V© W) T || F , where U = Udiag(d) (Harshman, 1970; Carroll and Chang, 1970). 



Thus, the CP-ALS algorithm estimates each factor sequentially by performing simple least 
squares and normalizing the columns of the factor matrix to have norm one, i.e. dk = ||ufc||. 
A simple strategy for encouraging sparsity is then to replace the least squares problem with 



an £i-norm penalized least squares, or LASSO problem (Tibshirani, 1996). In other words 



to estimate a sparse U, we minimize || Xm — U(V© W) T ||^ + A u ||U||i, where A u is a non- 
negative regularization parameter and 1 1 • | |i = ^ ^ \ ■ | is the ^-norm. Hence, a possible 
method for Sparse HOPCA is to replace each least squares update in the CP-ALS algorithm 
with a LASSO update; we call this the Sparse CP-ALS method. 

A first glance, it seems that the Sparse CP-ALS is less heuristic than the Sparse HOSVD 
and HOOI methods. As each update solves a LASSO regression problem, it is natural to 
presume that the algorithm jointly minimizes the Frobenius norm loss with l\ penalties on 
each of the factors: 



minimize 
u.v.w 2 



1 K 

-\\ X - 2j4u fc o v fc o w fc III + A u || U ||i + A v || V ||i + A w || W ||i 



k=l 



subject to dk > 0, = 1, = 1, & = 1, V k = 1, . . . K. 



(1) 



Interestingly, this is not the case! 



Proposition 1. The Sparse CP-ALS algorithm does not minimize ([!]). 



In fact, it can be shown that the Sparse CP-ALS does not optimize any coherent objective. 



Similar results, while not proved explicitly, have been implied for two-way penalized SVDs 



(Witten et al. 


2009 


Lee et al. 


2010 


Allen et al. 


2011 



and HOOI methods, convergence of the Sparse CP-ALS algorithm cannot be guaranteed 
without further constraints. In the next section, we present a framework for incorporating 
sparsity in the CP decomposition with provable results in terms of convergence to a solution 
of an optimization problem. 



4 Deflation Approach to Sparse HOPCA 

In this section, we develop a novel deflation approach to Sparse HOPCA that find each 
rank-one factorization by solving a tri-concave relaxation of the CP optimization problem. 
But first, we show how the CP decomposition can be obtained by an alternative greedy 
algorithm, the deflation-based Tensor Power Algorithm. 



4.1 Tensor Power Algorithm 



We introduce an alternative form of the rank-one CP optimization problem and a correspond- 
ing algorithm that will form the foundation of our deflation approach to Sparse HOPCA. 



The single-factor CP decomposition solves the following optimization problem (Kolda and 



Bader 2009): 



minimize || X — d\x o v ow subject to u T u = 1, v T v = 1, w T w = 1, & d > 0. (2) 

u,v,w,d 



Some algebra manipulation (see Kolda and Bader (2009)) shows that an equivalent form to 



this optimization problem is given by the following: 



maximize Afx 1 ux 2 vx 3 w subject to u 1 u = 1, v 1 v = 1, & w T w = 1. (3) 



As ([3]) is separable in the factors, we can optimize this in an iterative block-wise manner: 
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Proposition 2. The block coordinate-wise solutions for ^ are given by: 



u 



X X2 V X3 W 
X Xo V X q W I 



X Xi U X3 W 

A'xiux^wl 



& w 



A' Xx U X 2 V 
A' Xi U Xo V I 



Updated iteratively, these converge to a local optimum of 

As each coordinate update increases the objective and the objective is bounded above by 
d, convergence of this scheme is assured. Note, however, that this approach only converges 
to a local optimum of ([3]), but this is true of all other algorithmic approaches to solving the 



CP problem as well (Kolda and Bader, 2009) 



To compute multiple CP factors, one could apply this single-factor approach sequentially 
to the residuals remaining after subtracting out the previously computed factors. This so 
called deflation approach is closely related in structure to the power method for computing 



eigenvectors ( Golub and Van Loan , 1996 ) . We then call this greedy method the Tensor Power 



Algorithm. Notice that this method does not enforce orthogonality in subsequently computed 
components. The algorithm can be easily modified, however, to ensure orthogonality by 



employing a Graham-Schmidt scheme (Golub and Van Loan, 1996) 



Before moving on to our Sparse CP method, we pause to discuss the Tensor Power 
Method and compare it to more common algorithms to compute the CP decomposition such 



as the Alternating Least Squares algorithm (Harshman, 1970 Carroll and Chang, 1970). As 



the Tensor Power Algorithm is a greedy approach, the first several factors computed will 
tend to explain the most variance in the data. In contrast, the CP-ALS algorithm seeks 
to maximize the set of dk for all K factors simultaneously. Thus, the set of K CP-ALS 
factors may explain as much variance as those of the Tensor Power Algorithm, but the first 
several factors typically explain much less. This is illustrated on the tensor microarray data 
in Section [6] Also, while the CP-ALS algorithm returns dk in descending order, the dk 
computed via the Tensor Power Method are not necessarily ordered. 
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4.2 Sparse HOPCA via Deflation 

We introduce a novel deflation-based method for Sparse HOPCA that incorporates sparsity 
by regularizing factors with an ^-norm penalty. Our method solves a direct relaxation of 
the CP optimization problem ^ and has a computationally attractive solution. 

Many have sought to encourage sparsity in PCA by greedily solving £i-norm penalized 



optimization problems related the to rank-one SVD (Jolliffe et al. 2003 Zou et al., 2006 



Shen and Huangj [2008| |Witten et aLj [2009| |Lee et al.[ [20TU| |Journee et aL| [20"T0l [AHen 



et al. , 2011). We formulate our method similarly by placing £i-norm penalties on each of 



the tensor factors of the CP optimization problem ^ and relaxing the equality constraints 
to inequalities. In this manner, our approach is most similar to the two-way Sparse PCA 



methods of Witten et al. (2009) and Allen et al. (2011). We define our single-factor Sparse 



CP-TPA (named to denote its relation to our Tensor Power Algorithm) factorization as the 
solution to the following problem: 



maximize X X1UX2VX3W — A u ||u||i — A v 1 1 v 1 1 1 — A w ||w||i 

u,v,w 

subject to u T u < l,v T v < 1, & w T w < 1. 



(4) 



Here A u , A v and A w are non-negative regularization parameters controlling the amount of 
sparsity in the tensor factors. Relaxing the constraints greatly simplifies the optimization 
problem, leading to a simple solution and algorithmic approach: 

Theorem 2. Let u = S{X x 2 v x 3 w, A u ) ; v = S(X Xi u x 3 w, A v ), andw = S(X x x u x 2 v, 
where S(-,X) = sign(-)(| • | — A) + is the soft-thresholding operator. Then, the factor-wise so- 
lutions to the Sparse CP-TPA problem are given by: 



u 



u 2 



|u|| 2 > 



otherwise, 



v a 



Ivllo > 



otherwise. 



tt^tt ||w|| 2 >0 
||w|| 2 11 l|Z 

otherwise, 



Each factor-wise update monotonically increases the objective and when iterated, they con- 
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verge to a local maximum of 

Thus, Q can be solved a factor at a time by soft-thresholding and then re-scaling the 
update. While this type of algorithmic scheme may seem familiar to some, a major contri- 
bution of our work is determining the underlying optimization problem being solved by this 
algorithmic structure. 

There are many mathematical and computational advantages to this approach. First, 
notice that Q is a concave optimization problem in each factor with the other factors fixed. 
Thus, it is a multi-way concave optimization problem, meaning that we can solve for one fac- 
tor at a time with the others fixed. This scheme yields a monotonic algorithm that converges 
to a local maximum of Q (Tseng, 2001). This is important as, unlike our algorithmic based 



Sparse HOPCA methods, the progress of the algorithm can be tracked and convergence is 
assured. Second, notice that in Q, the scale of the factors is directly controlled. Even 
with the relaxed constraints, the solution for each factor is guaranteed to either have norm 
one or be set to zero. This is a major computational advantage as it leads to numerically 
well-conditioned algorithmic updates and solutions. Third, this method is computationally 
attractive as each factor-wise update has a simple analytical solution, and thus each update 
is computationally inexpensive. Additionally, this scheme requires far less computer memory 
than our algorithmic Sparse HOPCA approaches as only factors comprising the final solution 
need to be computed, and the tensor never needs to be matricized. The final advantage of 
this approach is its generality; as we will show in the next section, there are many possible 
extensions of this methodology. 

To compute multiple factors, we employ a deflation approach as in the Tensor Power 
Algorithm. Specifically, each factor is calculated by solving the single-factor CP problem, 
Q, for the residuals from the previously computed single-factor solutions. Notice that we do 
not enforce orthogonality in the factors. In fact, many have advocated against enforcing or- 



thogonality of sparse PCs (Zou et al. , 2006; Shen and Huang, 2008; Journee et al. , 2010). For 



factors with A = 0, however, if orthogonality is desired, this can be accomplished by altering 



the factor updates as described in Section |4.1| Thus, we have developed a deflation ap- 
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proach to Sparse HOPCA by greedily solving single-factor optimization problems related to 
the CP criterion. This Sparse CP-TPA method enjoys mathematical advantages such as as- 
sured convergence to a local optimum and is computationally attractive for high-dimensional 
tensors. 



4.2.1 Selecting Dimension and Regularization Parameters 



Another important item of consideration for our Sparse CP-TPA method and all of our 
Sparse HOPCA methods is the number of factors, K, to compute for a given data sets. 



Several have proposed heuristic methods for choosing K in HOPCA (see Kolda and Bader 



(2009) and Kroonenberg (2008) and references therein). While several non-heuristic methods 



exist for PCA on matrix data (Buja and Eyuboglu, 1992 Owen and Perry, 2009), extending 



these to the tensor framework is non-trivial and beyond the scope of this paper. 

Our Sparse CP-TPA problem has three regularization parameters that control the amount 
of sparsity in the factors. Several methods exist for selecting these regularization parameters 



in the Sparse PCA literature (Troyanskaya et al. 2001 Owen and Perry, 2009; Shen and 



Huang 


2008 


Lee et al. , 


2010) 



tensors, we choose to select regularization parameters via the Bayesian Information Criterion 



(BIC) ( |Lee et al.[ [20T0t |Allen et al.[ |20lT| ): A* = argmin Au BIC(X U ) where BIC(X U ) = 
log ^ — d "°J fOW ^ F j + l ° g ipg 9 ^ l{ u }|i where |{u}| is the number of non-zero elements of u. 
This BIC formulation can be derived by considering that each update of the Sparse CP- 
TPA algorithm solves an £i-norm penalized regression problem. Selection criteria for v 
and w are analogous. Similar BIC-based selection methods can be derived for our previous 
algorithmic Sparse HOPCA methods as well. Experimental results evaluating the efficacy of 



this regularization parameter selection method are given in Section 6.1 
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5 Extensions 

As our frameworks introduced for Sparse HOPCA are general, there are many possible ex- 
tensions of our methodology. In this section, we outline novel results and methodology for 
three extensions we consider most important: general penalties and non-negativity, gener- 
alizations for structured tensors, and multi-way functional HOPCA. Our intent here is to 
refrain from fully describing each development, but instead to give the reader enough details 
to understand and use each of these extensions. 



5.1 General Penalties & Non-negativity 

In certain applications, one may wish to regularize tensor factors with a penalty other than an 
£i-norm. Consider the following optimization problem which incorporates general penalties, 
P u (), P v () and P w (): 



maximize Afx 1 ux 2 vx 3 w — A u P u (u) — A v P v (v) — A w P w (w) 

u,v,w 



subject to u T u < l,v J v < 1, & w w < 1. 



T . 



(5) 



An extension of a result in Allen et al. (2011) reveals that one may solve this optimization 



problem for general penalties that are convex and order one by solving penalized regression 
problems: 

Theorem 3. Let P u (), P v () and P w () be convex and homogeneous or order one. Consider 
the following penalized regression problems: u = argmin u {^\ \ X x 2 vx 3 w — u 1 1| + A u P u (u)} ; 
v = argmin v {^\\ X X!ux 3 w — v | || + A V P v (v)} ; andw = argmm w {||| X XiUXjv-w H2 + 
A w Pw(w)}. Then, the block coordinate-wise solutions for (J5]) are given by: 



u 



|u|| 2 > 



otherwise 



|v|| 2 > 



otherwise 



,&w* 



Iwllo > 



otherwise 



Iteratively updating these factors converges to a local optimum of ^ . 
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An example of a possible penalty type of interest in many tensor applications is the group 



lasso, which encourages sparsity in groups of variables (Yuan and Lin, 2006) 



Much attention in the literature has been given to the non-negative and sparse non- 



negative tensor decompositions (Hazan et al. 2005 Shashua and Hazan, 2005 M0rup et al. 



2008 Cichocki et al. , 2009; Liu et al. , 2012). These techniques have been used for multi- 



way clustering of tensor data. A simple modification of Theorem [3] allows us to solve Q 
when non-negativity constraints are added for each factor: Replace the soft-thresholding 
function S(x,p) = sign(x)(|x|— p) + with the positive-thresholding function P(x, p) = (x— p) + 



(Allen and Maletic-Savatic 2011). This, then, is a computationally attractive alternative 



to estimating sparse non-negative tensor factors. Extensions of our deflation approach for 
Sparse CP-TPA allow for regularized HOPCA with general penalties and non-negativity 
constraints. 



5.2 Generalized HOPCA 



For structured data, Allen et al. (2011) showed that generalizing PCA by working with an 



alternative matrix norm capturing the known structure can dramatically improve results. 
The same techniques generalizing PCA can be used to generalized HOPCA for structured 
tensors. Examples of these include image data as seen in neuroimaging, microscopy and 
hyper-spectral imaging as well as multi-dimensional NMR spectroscopy, remote sensing, 
and environmetrics. Here, we illustrate how to extend Generalized PCA to the framework 
introduced for Sparse CP-TPA. Generalized Tucker, Sparse HOSVD, and Sparse HOOI 
methods £1X6 Su straightforward extension of these approaches. 



The Generalized PCA optimization problem of Allen et al. (2011) replaces minimizing a 



Frobenius norm with a transposable-quadratic norm comprised of two quadratic operators 
in the row and column space of the data matrix. To generalize a three-way Frobenius norm, 
let Q (1) G 3? nxn , Q (2) G 9£ pxp and Q {3) G $t qxq and define the three-way quadratic norm as the 
following: || X I Iq(i),qC2),q(3) = [YZ=i Y%=1 Y$=i Y?j>=i Yl=i YX>=i Qu> Qf) Qifc' XijkXi'j'V 
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1 /2 

(ELi EJ=i ELi *ijk) > where * = X xi Q (1) X2 Q (2) x 3 Q (3) - A ma J° r assump- 
tion implied by this three-way quadratic norm is separability of the structured errors along 
each of the tensor modes. In fact, one can show that this norm implies a three-way Kronecker 



covariance structure related to the array-normal distribution (Hoff, 2011). Give this, we de- 
fine the rank-one Generalized CP decomposition as the solution to the following optimization 
problem: 



1 



minimize 



X -(iuov OW || 2 m 



-H^-uuwvuw (1) (2) (3) 
u,v,w,d 2 W ,W >W 

subject to u T Q (1) u = 1, v T Q (2) v = 1, w T Q (3) w = 1, h d > 0. (6) 

In the same manner in which we motivated the Sparse CP-TPA optimization problem, we 
can define the rank-one Sparse Generalized CP-TPA decomposition to be the solution to the 
following: 

maximize X Xi u x 2 v x 3 w — A u || u ||i — A v || v ||i — A w || w ||i 

u,v,w, 

subject to u r Q (1) u < 1, v T Q (2) v < 1, & w r Q {3) w < 1. (7) 



Following from results in (Allen et al. 2011), these rank-one problems have closed form 
solutions. 

Proposition 3. The coordinate-wise updates for u are: 

1. Generalized CP: u* = (X x 2 Q (2) v x 3 Q (3) w)/||(AT x 2 Q (2) v x 3 Q (3) w)| | q( d . 



2. Sparse Generalized CP: Define u = argmin u {^\\ X x 2 Q 1 " 2 ' v x 3 Q*- 3 -* w — u ||q(i) + 
A u || u ||i}, then u* = u/||u||q(i) if ||u||q(i) > 0, and u* = otherwise. 

The updates for v and w are analogous. Together these updates converge to a local solution 
to the Generalized CP or Sparse Generalized CP problem. 

As with our TPA and Sparse CP-TPA methods, subsequent components can be calculated 
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via deflation. Thus, we have outlined a generalization of our results on HOPCA and Sparse 
HOPCA for use with structured tensor data. 



5.3 Multi-way Functional PCA 

Many examples of high-dimensional tensors have functional valued elements. Examples 
include multi-dimensional spectroscopy data commonly studied in chemometrics, series of 
images measured over time such as in neuroimaging, and hyper-spectral imaging data. For 
matrix data, many have used Functional PCA (FPCA) to reduced the dimension and dis- 



cover patterns in functional valued data (Silverman, 1996). Recently, Huang et al. (2009) 



extended the Functional PCA framework to encompass two-way functional data. These same 
techniques can be further extended to multi-way data to study the examples of functional 
tensor data described above. 



Huang et al. (2009) elegantly showed that estimating functional principal components by 



solving a bi-convex optimization problem and using deflation is equivalent to two-way half- 



smoothing the data, an extension of the FPCA half-smoothing result of Silverman (1996). 
Both the optimization approach and half-smoothing approach can be extended for multi- 
way tensor data. For tensors, however, these two approaches do not yield equivalent tensor 
factorizations. We briefly summarize these two approaches and their resulting properties. 
Consider functional data that has been discretized, yielding the tensor, X e gfj™xp Xi ? T3 e g ne 
S u = I(n) + ^u? S v = I( p )+O v , and S w = I( g )+f2 w , where f2 is a smoothing matrix commonly 



used in FPCA; examples include the squared second or fourth differences matrices (Ramsay 



2006). Then, we can define Tensor FPCA via half-smoothing as follows: (1) Half smooth the 
data, X <r- X XxS^T 172 x 2 S v ~ 1/2 x 3 S w ~ 1/2 ; (2) Take the Tucker decomposition of X « 
T> Xi U x 2 V x 3 W; (3) and half-smooth the components, <— S u _1 ^ 2 Ufc, <— S v ~^ 2 Vfc, 
and w fe -c- S w -1 ^ 2 Wfe. We can also define the rank-one Tensor FPCA as the solution to the 
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following tri-convex optimization problem: 



minimize 1 1 X — u o v o w | \ 2 F + u T S u u v T S v v w T S w w — 1| u | \\\ | v | \\\ | w | \\. (8) 

u.v.vv 

Subsequent components can be calculated via deflation. 

Proposition 4. 1. The coordinate-wise updates converging to a local minimum of ^ 
are given by: u = S u _1 (X x 2 v x 3 w) / (v T S v v w T S w w), with the updates for v and 
w are defined analogously. 

2. The solution to ^ is not equivalent to Tensor FPCA via half -smoothing. In other 
words, the latter is not necessarily a local minimum to ^ . 

Although half-smoothing and the solution to ^ are not equivalent, they both solve their 
respective optimization problems and thus the problem of non-convergence as with Sparse 
HOOI is avoided. Thus, methods for Tensor FPCA are another possible extension of our 
frameworks developed for Sparse HOPCA. 

6 Results 

We compare the performance of the various methods introduced for HOPCA and Sparse 
HOPCA in terms of signal recovery and feature selection on simulated data and in terms of 
dimension reduction and feature selection on a microarray and functional MRI example. 

6.1 Simulation Studies 

We evaluate the comparative performance of our methods on a simulated low rank three-way 
tensor model comprised of sparse rank-one factors. All data is simulated from the following 
model: X = Xlf=i 4 u fc ov it ow fc + ^, where the factors u^,Vfe, and w/, are random, dk is 
fixed and Sij t i ~ N(0, 1). Four scenarios are simulated and summarized as follows. Scenario 
1: 100 x 100 x 100 with U sparse; Scenario 2: 1000 x 20 x 20 with U sparse; Scenario 3: 
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Fi gure 1: ROC curves averaged over 50 replicates for each of the four simulation scenarios described in 
Section \6.1\ The Sparse CP-TPA method performs similarly to Sparse HOSVD and Sparse HOOI in all 
but Scenario 3 where the later two have slight advantages. The Sparse CP-ALS method, on the other hand, 
exhibits poor recovery of the true features, with performance often comparable to naive thresholding. 

100 x 100 x 100 with U, V and W sparse; and Scenario 4: 1000 x 20 x 20 with U, V 
and W sparse. Sparse factors are simulated with 50% randomly selected elements set to 
zero and non-zero values are i.i.d. N(0, 1). Non-sparse factors are simulated as the first K 
left and right singular vectors of a data matrix with i.i.d. N(0, 1) entries. In simulations 
where K = 1, di — 100. In simulations where K = 2, di = 200 and c?2 — 100. Additional 
simulation results for these four scenarios with various signal to noise levels and a differing 
percentages of sparse elements are given in the Supplemental Materials. 

First, we test the accuracy of our methods in selecting the correct non-zero features. 
Receiver-operator curves (ROC) averaged over fifty replicates computed by varying the reg- 
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Scenario 1 




CP 


Tucker 


Sparse CP-ALS 


Sparse CP-TPA 


Sparse HOSVD 


Sparse HOOI 


TP / FP ui 






0.8308 / 0.8108 


0.9332 / 0.0568 
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TP / FP u 2 
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0.8688 / 0.0324 
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0.8704 / 0.0304 


Signal Recovery X 


1.0052 


1.0053 


0.9993 
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Scenario 2 




CP 


Tucker 


Sparse CP-ALS 


Sparse CP-TPA 


Sparse HOSVD 


Sparse HOOI 
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0.7373 / 0.0329 


0.6954 / 0.0066 


0.7805 / 0.0197 


Signal Recovery X 


1.0068 


1.0032 


0.9981 
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Scenario 3 




CP 


Tucker 


Sparse CP-ALS 


Sparse CP-TPA 


Sparse HOSVD 
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0.9412/ 0.1696 
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0.0496 




Scenario 4 




CP 


Tucker 


Sparse CP-ALS 


Sparse CP-TPA 


Sparse HOSVD 
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0.8972 / 0.1618 


0.8617 / 0.0256 


0.8876 / 0.0184 
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0.7986 / 0.1455 


0.7083 / 0.0067 
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0.9400 / 0.0000 
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0.9400 / 0.4720 


0.9080 / 0.1880 


0.9700 / 0.5620 


0.9520 / 0.1360 


TP / FP wi 






0.9840 / 0.4560 


0.9260 / 0.0620 
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0.9120 / 0.0000 


TP / FP w 2 






0.9360 / 0.4800 


0.9000 / 0.1640 


0.9660 / 0.6140 


0.9380 / 0.1380 


Signal Recovery X 


1.0090 


1.0025 


0.9997 


0.1252 


0.1238 


0.1234 



Table 1: True and false positive rates (TP and FP) and signal recovery measured in mean squared error 
for the four simulation scenarios described in Section \6.1\ averaged over 50 replicates. The Sparse CP-TPA 
method performs similarly to Sparse HOSVD and Sparse HOOI, with the later having a slight advantage in 
some scenarios. The Sparse CP-ALS method performs poorly both in terms of feature selection and signal 
recovery. 



ularization parameter, A, are given for each of the four scenarios in Figure [Tj We compare 
the Sparse HOSVD, Sparse HOOI, Sparse CP-ALS, and Sparse CP-TPA to naive threshold- 
ing of the CP and Tucker decompositions that act as a baseline. In this paper, we implement 



the Sparse HOSVD and Sparse HOOI using the Sparse PCA method of Shen and Huang 



(2008). From these comparisons, we see that the Sparse HOSVD and Sparse HOOI consis- 
tently perform the best across all four simulation scenarios. The Sparse CP-TPA method 
performs equally well in all but Scenario 3 where all factors are sparse and the samples sizes 
are equal. The Sparse CP-ALS method has considerably worse performance and barely bests 
the naive thresholding methods. 



Next in Table 6.1, we compare the performance of our methods in terms of feature 
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100 x 100 x 100 


1000 x 20 x 20 


250 x 250 x 250 


5000 x 50 x 50 


Sparse CP-ALS 


0.075 


0.030 


1.665 


3.240 


Sparse CP-TPA 


0.090 


0.030 


1.800 


5.335 


Sparse HOSVD 


0.720 


0.330 


13.355 


24.300 


Sparse HOOI 


0.835 


0.430 


14.230 


26.950 



Table 2: Median time in seconds over ten replicates for convergence of each algorithm for a fixed value of 
the regularization parameter. A rank-one model in which each factor is 50% sparse is employed. The Sparse 
CP methods are markedly faster than the Sparse Tucker methods. 

selection and low rank signal recovery, measured in mean squared error, at one value of the 
regularization parameter, A. To be consistent, the BIC method was used to select A for 
all methods. Again, the Sparse HOOI is the best performing method with the Sparse CP- 
TPA method performing similarly in all but Scenario 3. Simulation results under different 
percentages of sparsity and different signal levels presented in the Supplemental Materials 
exhibit similar behavior. 

Finally, we compare the time until convergence for each of our methods for Sparse 
HOPCA in Table [2] Note that while convergence cannot be mathematically guaranteed for 
the methods discussed in Section |3j in practice effective use of warm starts and constraints 
on the change in factors permitted in each iteration often yield convergent algorithms. Tim- 
ings were carried out on a Intel Xeon X5680 3.33Ghz processor and methods were coded as 



single-thread processes run in Matlab utilizing the Tensor Toolbox (Bader and Kolda, 2010). 
From these timing results, we see that the Sparse CP methods are considerably faster than 
those based on the Tucker decomposition. While the Sparse CP-ALS method is fastest, it 
also has the worst performance. Overall, in addition to having nice mathematical properties, 
the Sparse CP-TPA method offers a good compromise between fast computation and strong 
performance results in terms of feature selection and signal recovery. 



6.2 AGEMAP Microarray Example 

We analyze the high-dimensional AGEMAP microarray data set (publicly available from 
http : //www. grc .nia.nih. gov/branches/rrb/dna/ agemap_data.htm ) with Sparse HOPCA. 
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Figure 2: 

data (left) 



Cumulative proportion of variance explained by third order PCs on the AGEMAP microarray 
and the StarPlus fMRI data (right). 



This data consists of gene expression measurements for 8,932 genes measured for 16 tissue 



types on 40 mice of ages 1, 6, 16, or 24 months (Zahn et al. , 2007). As measurements 



for several mice are missing for various tissues, we eliminate any tensor slices that are en- 
tirely missing, yielding a data array of dimension 8932 x 16 x 22. Scientists seek to discover 
relationships between tissue types of aging mice and the subset of genomic patterns that 
contribute to these relationships. These patterns in each tensor mode cannot be found by 
simply applying PCA or Sparse PCA to the fattened tensor. 

The CP-TPA, CP-ALS, and Tucker decompositions as well as Sparse CP-TPA were ap- 
plied to this data to reduce the dimension and understand patterns among tissues and genes. 
In the left panel of Figure |2j the cumulative proportion of variance explained by the first 
eight components is given. Notice that the CP-ALS, CP-TPA, and Tucker decompositions 
explain roughly the same proportion of variance with eight components, but the CP-TPA 
and Tucker decomposition explain much more initial variance in the first several PCs. As 
often scientists are only interested in the first couple principal components, this illustrates an 
important advantage of our CP-TPA and Sparse CP-TPA approach as compared to CP-ALS 
in the analysis of real data. 

In Figure |3j we explore patterns found in the AGEMAP data via Sparse HOPCA. The 
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results were computed using the Sparse CP-TPA method placing a penalty on the gene mode 
with the BIC used to select A. In the top panel, we show scatterplots of the first six PCs 
for the tissue mode. We see many clusters of tissue types for the various pairs of PCs. For 
example, adrenal, gonads and bones often cluster together. As only a subset of genes are 
selected by each of the PCs, we can analyze the genetic patterns further for each tissue type. 
Gonads has a higher value for the second principal component, for example, so we display 
a cluster heatmap of the 1,439 genes selected in PC2 for this tissue type in the lower left 
panel of Figure [3j We see that the genes selected by this sparse PC perfectly separate the 
male and female mice. As liver has a lower PC value for the third tissue component, we 
display the cluster heatmap for the 514 genes selected by this component for liver in the 
lower right panel of Figure 3. Again, we see that this component clusters the mice well 
according to their ages. Further plots and analyses of this type reveal subsets of important 
genes associated with various tissues and mice ages or gender. This type of multi-mode 
analysis is an important advantage of applying Sparse HOPCA as opposed to Sparse PCA 
on a flattened tensor. 



6.3 StarPlus fMRI Example 

As functional MRIs are a common source of high- dimensional tensor data, we apply our meth- 
ods to understand patterns for subject 04847 of the StarPlus fMRI experiment (publicly avail- 



able from http : //www . cs . emu. edu/af s/cs . emu. edu/project/theo-81/www/) (Mitchell et al 



2004). This data set consists of 4,698 voxels or spatial locations in the brain sampled on a 
64 x 64 x 8 grid, measured over 54 - 55 time points for each of 40 tasks. In each task, the 
subject was shown an image and read a sentence. The sentence either explained the image, 
for example an image of a star with a plus above paired with the sentence, "The plus is 
above the star.", or in which the sentence negated the image. We analyze data for each of 
the 36 tasks lasting for 55 time points, yielding a tensor array of dimension 4, 698 x 55 x 36. 
We apply HOPCA, Sparse HOPCA, Generalized CP, and Sparse Generalized CP-TPA 
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Fi gure 3: Analysis of AGEMAP Microarray data via Sparse Higher- Order PC A. (Top Panel) Scatterplots 
of the first eight principal components for the 16 tissue types. (Lower Left) A cluster heatmap of the genes 
selected by sparse PC 3 for the tissue Gonads labeled by gender. (Lower Right) A cluster heatmap of the 
genes selected by sparse PC 8 for the tissue Heart labeled by age in months. 



methods to understand the spatial, task and temporal patterns in the fMRI data (M0rup 



et al. 



2008). The Generalized HOPCA methods were applied with a graph Laplacian 
of the nearest neighbor graph connecting the voxels, a kernel smoother over the 55 

time points and = I. These quadratic operators were selected such that the variance 



explained by the first GPC was maximal as described in Allen et al. (2011). Sparsity was 
incorporated into the spatial component to select relevant regions of interest. In the right 
panel of Figure |2j we present the cumulative proportion of variance explained by each of 
the methods. Generalized HOPCA methods strikingly explain a much great proportion of 
variance as they incorporate known spatio-temporal structure into the tensor factorization. 
In Figure |4| we compare the first two HOPCs of the Tucker decomposition to those of the 
Sparse Generalized CP-TPA method. The latter yield interpretable temporal patterns and 
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discrete regions of interest related to the tasks. The temporal patterns of the Tucker de- 
composition appear to be confounded by sinusoidal noise with noisy, uninterpretable spatial 
components. 

7 Discussion 

We have developed methodology for regularizing HOPCA in the context of high-dimensional 
tensor data. Beginning with algorithmic approaches, we presented Sparse CP-ALS, Sparse 
HOSVD, and Sparse HOOI methods based on popular algorithms for the CP and Tucker 
decompositions. While these methods are intuitive, they fail to solve a coherent optimiza- 
tion problem, and are thus mathematically and computationally less desirable. Next, we 
develop a greedy framework for computing HOPCs based on deflation and show that a sim- 
ple multi-way concave relaxation of this optimization problem leads to an approach we term 
Sparse CP-TPA. As this method converges to a local solution of a well-defined optimization 
problem, it enjoys many mathematical and computational advantages. Finally, we show how 
the deflation approach to HOPCA can be extended to work with general penalties and con- 
straints, and with structured or functional tensor data. A major strength of our approach is 
its flexibility to model different types of high-dimensional tensor data, a quality illustrated 
on our examples to microarray and fMRI data. 

There are several items related to our work to study further. First, we note that while 
we have presented all of our methods with three-mode tensors for notational convenience, 
extensions to tensors with an arbitrary number of modes is trivial. The extensions of our 
Sparse HOPCA frameworks to encompass general penalties and constraints, structured ten- 
sors, and multi-way functional data outlined in Section [5] deserve further investigation and 
development. Simulation studies and comparisons to existing approaches are needed to fully 
evaluate the utility of these methods. 

Two important items related to our regularized HOPCA frameworks are beyond the 
scope of this paper and require further investigation. Determining how many HOPCs to 
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Fi gure 4: Depictions of the first two task PCs, time course PCs and spatial PCs for the Tucker decom- 
position (top panel) and Sparse Generalized CP-TP A (bottom panel) for the StarPlus fMRI data. In the 
scatterplots, red denotes the sentence negating the image while blue denotes sentence and image agreement. 



27 



extract for a given data set is an important problem. For PCA on matrix data, there are 



several data-driven methods available (Buja and Eyuboglu, 1992 Owen and Perry, 2009), 



but extending these approaches to the tensor domain is a non-trivial task. Our result on the 
amount of variance explained by each of the HOPCs or Sparse HOPCs, Theorem [TJ gives 
users of our methodology a way to gage the dimension reduction achieved, but further work 
needs to be done to determine the optimal number of factors. Secondly, results for every 
regularized HOPCA method, and in fact all methods for HOPCA, depend heavily on initial 



algorithmic starting values (Kroonenberg, 2008). This occurs as all HOPCA and regularized 



HOPCA methods find at best a local optimum. With Sparse PCA methods on matrix data, 
the same problem occurs but is less critical as one can initialize algorithms to the SVD which 
is a global solution. With tensor factorizations, further work is required to determine the 
best initializations for our regularized HOPCA algorithms. 

As the statistical attention given to high-dimensional tensor data has been limited, topics 
for further study abound. Recently, many have made progress on understanding the asymp- 



totic properties of PCA and Sparse PCA by using random matrix theory (Johnstone and 



Lu, 2009; Jung and Marron, 2009 Amini and Wainwright, 2009). As yet, there is no work 



on developing consistency results for HOPCA which would serve as a precursor to studying 
consistency for the methods presented in this paper. Additionally, our framework for regu- 
larized HOPCA is a foundation upon which to study other multivariate analysis techniques 
for tensor data. These include methods for clustering, canonical correlations analysis, partial 
least squares, discriminant analysis, and multi-dimensional scaling. Variable selection in the 
context of regression and classification for high-dimensional matrix data has been a topic of 
great interest recently. One can imagine that variable selection for high-dimensional tensors 
may be more critical in these supervised prediction methods. 

Finally, there are a plethora of possible applications of our methodology. Technologies 
in biomedical imaging are producing massive multi-way data sets that are a challenge to 
understand and analyze. Examples of these are especially common in neuroimaging, radiol- 
ogy, and chemometrics. Other imaging data sets such as from hyper-spectral cameras and 
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remote sensing often contain three or more measured dimensions. Additional examples of 
high-dimensional tensor data come from environmental and climate studies, bibliometrics, 
and network reliability. Overall, our methods for regularized HOPCA are foundational tools 
for understanding high-dimensional tensors that will enjoy wide-ranging applicability. 
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A Algorithms for HOPCA & Sparse HOPCA 



Algorithm 1 Sparse Higher-Order SVD (Sparse HOSVD) 



1. 


U «- 


- First K\ sparse principal components of Xm. 


2. 


V <- 


- First K% sparse principal components of X( 2 ). 


3. 


W i 


— First K3 sparse principal components of X( 3 ). 


4. 


T> <- 


- X xiU x 2 V x 3 W. 



B Additional Simulation Results 



C Proofs 



Proof of Theorem [7| The proof is an extension of a result in Shen and Huang ( 2008 ) . Recall 
that the cumulative proportion of variance explained by the first k traditional principal 
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Algorithm 2 Sparse Higher-Order Orthogonal Iteration (Sparse HOOI), or Sparse Tucker- 
ALS 



1. Initialize U, V, and W using the Sparse HOSVD algorithm. 

2. Repeat until convergence or maximum number of iterations reached: 

(a) U <— First K\ sparse principal components of {X X2 V X3 W)m. 

(b) V «— First K2 sparse principal components of [X Xi U X3 W)( 2 ). 

(c) W «— First K3 sparse principal components of (X XiU x 2 V)( 3 ). 

3. T> <- X X1UX2VX3W 



Algorithm 3 Sparse CP-ALS Algorithm 

1. Initialize U (0) , V {0) and W (0) . 

2. Repeat until convergence or maximum number of iterations reached: 

(a) Estimate U (m) : 

i. U «- arguing { \\ | X (1) - U(V W W (t) ) T | \% + X v \ | U | |i } . 

ii. dk <— ||ufc||. 

111. <- u fc /4. 

(b) Estimate V (m) : 

i. V <- argmin v [ \\ | X (2) - V(U (m) W*" ) T | \% + A v 1 1 V 1 1 1 } . 

ii. d k <- ||v fc ||. 

iii. v^ +1) <- v fc /d fc . 

(c) Estimate W (m) : 

i. W <- argmin w [ || | X (3) - W(U (m) V( t+1 )) T | || + A w | | W ] d} . 

ii. d k <r- 1 1 w fc j J . 

iii. w^ +1) <- w fe /4. 
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Algorithm 4 Tensor Power Algorithm 



1. Initialize X = X. 

2. For k — 1...K 

(a) Repeat until converge: 

i. life <- X x 2 v fc x 3 w fc / 1 1 AT x 2 v fc X 3 Wfc || 2 . 

ii. v fc <- X x 1 u k x 3 w fc / \\X x x u fc x 3 w fe || 2 . 

iii. w fe <- ^ x x u k x 2 v fc / \\X Xi u fc x 2 v fe || 2 . 

(b) 4 «- AT Xi u fc x 2 v fc x 3 w fe . 

(c) X 4- X - d k u fc o v fc o w fc . 

Algorithm 5 Sparse CP-TPA 

1. Initialize X = X . 

2. For k = 1 ... K 

(a) Repeat until converge: 

i. Ufc = S (x x 2 Vfc x 3 Wfc, A u 

ii. Vfc = S {x xi Ufc x 3 Wfc, A v 

iii. Wfc = 5 ^ x x Ufc x 2 Vfc, A w 

(b) 4 A' xi Ufc x 2 Vfc x 3 Wfc. 

(c) X <- X - dfcUfcO Vfc o wfc. 




||ufc|| 2 > 
otherwise. 

I|vfe||2 > 
otherwise. 

||wfc|| 2 > 
otherwise. 



components can be written as the ratio of the squared Frobenius norm of the data projected 
onto the first k left and right singular vectors to the squared Frobenius norm of the data 



(Jolliffe and MyiLibrary, 2002). Notice that P k U \ an< ^ are projection matrices with 



exactly k eigenvalues equal to one. Therefore, the result is an extension of the definition of 
cumulative proportion of variance explained to the tensor framework. Namely, the numerator 
is simply the projection of the tensor onto the subspace spanned by the first k HOPCA 
factors. □ 

Proof of Proposition [7j We will show that the updates for U in the Sparse CP-ALS algorithm 
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are not equivalent to the optimal point for U from (11]). Arguments for V and W are 
analogous. First, note that the problem Q with respect to U can be rewritten using the 
Khatri-Rao product as minimizeu ||| X(x) — U(V W) T |||i + A u || U ||i subject to u[u k < 
1 V k = 1, . . . K. To simplify notation, we will write H = V W and Y = Xm. As this 
problem is convex in U, the KKT conditions are necessary and sufficient for optimality. 
These conditions include the sub-gradient equation: 

Y H T — U* H H T + A u r(U*) — 2\&* U* = 0, (9) 

where T() is the sub-gradient of the ^-norm and \&* is the Lagrange multiplier, namely a 
diagonal matrix that from complimentary slackness is strictly non-zero, ty* >- if and only 
if the columns of U* have norm one. 

Now consider the sub-gradient equations implied by the Sparse CP-ALS updates for U 
before scaling the solution U to have column norm one: 

Y H T — U H H T + A u r(U) = 0. 

We can represent scaling the columns of U as right multiplying by a diagonal matrix: UD. 
Since T() is order one, then r(UD) = r(U) and the above sub-gradient equation is equal 
to: 

YH T -(U D) D 1 H H T + A u T(U D) = 0, Y H r -(U) D 1 H H T + A u T(U) = 0, 

where U = UD. Now, for the Sparse CP-ALS update to solve ([TJ, the above sub-gradient 
equation must be equivalent to ^ for some diagonal \&* >~ and some diagonal D >~ 0. This 
means that U*(HH T +2$*) = U D 1 H H T . Assuming that U = U* and solving for 
we have that \I>* = HH T (D _1 — 1)/2. Thus, ^* is diagonal if and only if HH T is diagonal. 
This is a contradiction. Therefore, the updates from the Sparse CP-ALS algorithm do not 
solve (JTJ) . □ 
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Proof of Proposition [S| Consider optimizing ^ with respect to u. The Lagrangian is given 
by L(u, 7) = (AT X2VX3W) Xi u— 7(u T u — 1). The KKT conditions imply that u* = 



— X |^» X3 w and 7* is such that (u*) T u* = 1. Putting these together we have the desired 
result for u. The arguments for v and w are analogous. □ 



Proof of Theorem^ The proof follows from an extension of results in Witten et al. (2009) 



and Allen et al. (2011). In short, consider optimizing Q with respect to u. The KKT condi- 



tions imply that X x 2 v x 3 w — p u T(u*) —27* u* = and 7*((u*) T u* —1) = where T(u) is 
the subgradient of 1 1 u | |i, and 7 is a Lagrange multiplier. Consider u = S(X x 2 v x 3 w, p u ). 
Then, taking u* = u/||u|| 2 and 7* = ||u|| 2 /2 simultaneously satisfies the KKT conditions. 
Since the problem is convex in u, the conditions are necessary and sufficient; hence, the pair 
(u*,7*) are the optimal points. It is easy to verify that the pair (0,0) also satisfy the KKT 
conditions and are optimal points. □ 



Proof of Theorem^ The proof follows from an extension of results in Allen et al. (2011 ). Let 
us briefly consider the case for u and the arguments for v and w are analogous. We will show 
that the optimal points, (u*, v*), implied by (jsj) are equivalent to the said solution. The sub- 
gradient equation for ^ is: X x 2 v x 3 w —27* u* — A u VP u (u*) = 0, where VP U () is the sub- 
gradient of P u (). Now, consider the subgradient equation of minimize u ||| X x 2 v x 3 w — u|||+ 
A u P u (u) and for some c > 0: 

= X x 2 x 3 w-u- A u VP u (u) = X x 2 x 3 w — (cu) - A u VP u (u) = X x 2 x 3 w-u/c- A u VP u (u), 

c 

where u = cu. Since P u () is order one, VP u (x) = VP u (cx) Vc > 0. Taking c = l/||u|| 2 
and letting 7* = w- = ||u|| 2 /2, we see that for the pair (u* = u/||u|| 2 ,7* = ||u|| 2 /2) 
the subgradient equation of the penalized regression problem is equivalent to that of ([5]). 
Following from complimentary slackness, 7* = if and only if u = 0, and hence the pair 
(0, 0) also satisfy the KKT conditions of ([5]). We have thus proven the desired result. □ 

Proof of Proposition [3| The proof for part one follows from an extension of Proposition 
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2 to the generalized eigenvalue shown in flAllen et al. 2011b. In brief, the gra- 

dient equation of ([6} is X x x Q (1) x 2 Q (2) v x 3 Q (3) w -27 Q (1) u = meaning that u* = 
X x 2 Q (2) v x 3 Q (3) w /27* where 7* is such that u T Q (1 ^ u* = 1. Putting these together we 
have the desired result. 



The argument for part two follows from an extension of Theorem [3] and Allen et al. 



(2011). For completeness, we show that the subgradient of ^ with respect to u, 
X XxQ (1) x 2 Q (2) v x 3 Q (3) w-27Q (1) u- A u T(u) = where T(-) is the sub-gradient of 
is equivalent to the Sparse Generalized CP update given in Proposition |3j Let y = 
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X x 2 Q*- 2 " 1 v x 3 w and let u be the argument minimizing ||| y — u ||q(i) + A u || u ||i. Then, 
for some c > 0, = Q (1) y - Q {1) u - A u T(u) = Q (1) y - Q (1) u/c - A u T(u), where li = cu, 
since T(u) = T(cu) Vc > 0. Taking 7* = l/2c = ||u||q(i)/2 we see that the for both pairs 
(u* = u/||u||q,7* = ||u|| q (d/2) and (u* = 0,7* = 0) satisfy the KKT conditions of Q, 
thus proving the desired result. □ 

Proof of Proposition^ For part 1, consider the gradient equation of ^ with respect to u: 
^ = -IX x 2 v x 3 w+2uv T vw T w -2u || v ||||| w ||| + 2S u uv T S v v w T S w w = 0. It is 
then easy to see that the coordinate update is u* = Sy 1 ^x 2 vx 3 w/ (v T S v v w T S w w). 



For part 2, let us review the two-way matrix case proved in Huang et al. (2009). They 
show that two-way FPCA via half-smoothing implies that u oc S u x Xv and v oc S^ 1 X T u 
which are the stationary points of the two gradient equations for the penalized optimization 
problem. For our tensor case, the gradient equations of ^ imply u oc S" 1 X x 2 v x 3 w and 
so forth, but these are not satisfied by tensor half-smoothing. As the Tucker decomposition 
has a non-diagonal core, u x is proportional to the first column of S U 1 (AT x 2 V x 3 W) T>. 
Thus, the stationary points implied by half-smoothing and ([8| are not equivalent. □ 
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